Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial integration of TUF client into Nexus #469

Closed
wants to merge 14 commits into from
Closed

Conversation

iliana
Copy link
Contributor

@iliana iliana commented Dec 1, 2021

RFD 183 recommends the use of TUF as a transport mechanism to get updates from Oxide to the rack for a few reasons, one of which is "I wrote a TUF client once already". This is the initial integration work to add tough to Nexus.

I am now, of course, paying the costs of decisions I made two years ago. tough is not yet async (since it was written right as async/.await was starting to stabilize), RepositoryBuilder and Repository are not Send, and tough has a defensively-built API, intended for systems that have few mechanisms to handle threats to an update system. So there's this awkward tokio::task::spawn_blocking boundary between tough and the rest of Nexus.

Note the update URLs are currently hardcoded to paths on http://localhost:8000 because Nexus doesn't fetch the Rack from the database yet.

Quick top-level overview

artifacts.json is a target in the TUF repository we build. It describes updates available for a specific "thing" (in this case, the not-yet-existing CockroachDB zone), an integer version number, what kind of thing it is (in this case, a zone), and the name of the target as listed in the targets.json TUF role.

{
    "version": "1",
    "artifacts": [
        {
            "name": "cockroach",
            "version": 1,
            "kind": "zone",
            "target": "whatever.tar.gz"
        }
    ]
}

This code adds an /updates/refresh endpoint to more-or-less copy the data from this file (and the target's checksum/size) into a database table.

The goal is for Nexus to be aware of what updates are available for different "things", and be able to download those while adhering to the TUF specification (not fetching if the metadata is expired, and checking the checksum and size), but not having to re-fetch the metadata every time it wants to fetch an artifact.

(I am going to work on artifact fetching next, and either append it to this PR or in a next PR.)

nexus is async, tough is not async, but tough uses reqwest in blocking
mode which is async.

https://docs.rs/reqwest/0.11.7/reqwest/blocking/index.html:

> Conversely, the functionality in `reqwest::blocking` must not be
> executed within an async runtime, or it will panic when attempting to
> block.

moves the relevant code out to updates.rs, since the function can't
borrow self due to lifetime constraints.
@iliana iliana requested a review from smklein December 1, 2021 20:57
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome start, super cool to see this coming together!

I think your judicious use of spawn_blocking to decouple from the tough async constraints was a good call - the rest of PR looks mostly unaffected by that part of the codebase.


[updates]
# If not present, accessing the TUF updates repository will fail
#tuf_trusted_root = "/opt/oxide/nexus/pkg/root.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would this root.json come from? I'm noticing this part of Nexus config present-but-commented-out everywhere, sorta wondering what the plan is gonna be for getting this running in a dev/test environment.

Copy link
Contributor Author

@iliana iliana Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably write up a script for generating one of these, but copying and pasting from history:

tuftool root init 1.root.json
tuftool root gen-rsa-key 1.root.json ~/test.key -r root snapshot targets timestamp
[ manually edit a few things in 1.root.json, namely expiration and key thresholds ]
tuftool root sign -k ~/test.key 1.root.json

then creating a repository from the files in ~/in:

tuftool create -k ~/test.key --outdir repo --root 1.root.json \
    --snapshot-expires 'in 3 days' --targets-expires 'in 3 days' --timestamp-expires 'in 3 days' \
    --snapshot-version $(date +%s) --targets-version $(date +%s) --timestamp-version $(date +%s) \
    --add-targets ~/in

Comment on lines +1877 to +1879
/**
* Refresh update metadata
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good candidate for an addition to https://github.com/oxidecomputer/omicron/blob/main/tools/oxapi_demo , as a convenience

}]
async fn updates_refresh(
rqctx: Arc<RequestContext<Arc<ServerContext>>>,
// _query_params: Query<PaginatedById>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code?

Comment on lines 2277 to 2278
// first, the JoinError:
.map_err(|e| Error::InternalError { internal_message: e.to_string() })?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens when the call to read_artifacts panics, right?

  1. I think we're using panic=abort, so this shouldn't happen,
  2. ... if it did, panicking ASAP seems like the right behavior.

In other codebases, I just unwrapped the first error for this reason: https://github.com/oxidecomputer/async-bb8-diesel/blob/09f17db081be85abdf4f773a385381637b9c8d76/src/connection.rs#L42-L45

.map_err(|e| Error::InternalError { internal_message: e.to_string() })?
// next, the boxed dyn Error:
.map_err(|e| Error::InternalError {
internal_message: e.to_string(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we add a little context here - just something like "we were trying to refresh updates, and saw {}"

Comment on lines +2284 to +2286
// FIXME: if we hit an error in any of these database calls, the available artifact table
// will be out of sync with the current artifacts.json. can we do a transaction or
// something?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do a transaction or something

Yeah, there's an example here: https://github.com/oxidecomputer/async-bb8-diesel/blob/22c26ef840075a57b18ac2eae3ead989b992773e/examples/usage.rs#L72-L82

If we wanted do this, we would probably have a call to self.db_datastore that takes all the artifacts we wanna upsert at once, so the transaction could be self-contained in the datastore function.

All that being said... the upserts are idempotent, right? We could have a separate single-column table for targets_role_version, which we update once all the necessary artifacts are updated. Then the steps would be:

  1. upsert all the artifacts
  2. update the "desired" targets role version
  3. delete all older artifacts

And I don't think they'd need to be transactional anymore.

Either way, if you wanna punt on this until after the PR is merged, could we file a bug to track? DB concurrency shenanigans are hairy, but I don't want us to lose track of 'em.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this function called? If it's from an API endpoint, and the API request fails, then at least the caller knows it didn't work and we're basically saying its their responsibility to retry until it works.

Is it also a problem that it might be inconsistent? That is, we might have updated some of the artifacts but not all of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's inconsistent, then either Nexus will be missing information about a newer update, or might try to update to a removed artifact (if we ever have to do that for some reason).

Comment on lines +1984 to +1985
tuf_metadata_base_url: "http://localhost:8000/metadata".to_string(),
tuf_targets_base_url: "http://localhost:8000/targets".to_string(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might also be good candidates for the Nexus config file?

Also, should we have instructions on "how to set this up"? I don't think these servers are running on most people's machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just been running python3 -m http.server in the output directory from tuftool. (Instructions for making a repo probably should go along with that.)

These maybe should belong to the Nexus config file to start, but these eventually need to be configurable by an operator... thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the longer term plan here? Are we building a new component to serve this endpoint? Is it part of the control plane? If not, maybe we want to treat this like we do Clickhouse and CockroachDB, with easy ways to run it by hand that we use in the test suite to get coverage of this stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer-term plan is for this to default to something.oxide.computer via HTTPS and be stored in the database, ready for an end-user to change it if they need to.


let mut current_version = None;
for artifact in artifacts {
current_version = Some(artifact.targets_role_version);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sanity-check -- if we see multiple values of artifact.targets_role_version, they should all be the same, right? (trying to mentally verify that the order of artifacts couldn't change behavior)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these will all be the same.

This is kind of messy, it's a last minute "well crap I need this value" that we can clean up when moving the upsert and delete operations into the same transaction.

.update_available_artifact_hard_delete_outdated(current_version)
.await?;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you think we'd actually try to apply updates (basically, making the calls to POST /update on each sled) to the rest of the rack? Would that be at the end of this function, or some other time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the demo, probably immediately just for simplicity's sake?

At release I think a background task that periodically checks for out-of-date artifacts on the rack and starts applying updates is probably correct. Otherwise we need to handle applying updates here as well as when a sled-agent comes online (e.g. brand new sled, or sled that's been powered off for a few weeks).

pub name: String,
pub version: i64,
pub kind: UpdateArtifactKind,
pub target: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(... I feel like I may have asked you this before, but I can't remember the answer)

What is this field? Is it a reference to the targets.json file somehow? If it's the name of the artifact, how is it different from name?

Related: Could we add some docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the name of the target in targets.json, yeah. I should do a docs / "is this name sensible" pass again.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Sean said, this looks like a great start!

mode = "file"
path = "logs/server.log"
if_exists = "append"
#mode = "file"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was supposed to be exactly the same as nexus/examples/config.toml, except it logs to a file instead of stdout. I think maybe you didn't mean to comment out this section?

pub struct UpdatesConfig {
/** Trusted root.json role for the TUF updates repository. If `None`, accessing the TUF
* repository will fail. */
pub tuf_trusted_root: Option<PathBuf>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any deployment where we might want to leave this out? (i.e., why not require this? is it a pain for development?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this could be temporary, but given all the work going on right now I didn't want to break folks' testing workflows. There should not be production deployments where this is somehow left unconfigured.

@@ -330,6 +340,7 @@ mod test {
.parse()
.unwrap()
},
updates: UpdatesConfig { tuf_trusted_root: None },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test fills everything out instead of leaving anything optional. It seems useful to have at least one test that does that for the updates section.

@@ -142,8 +144,8 @@ impl Nexus {
* Create a new Nexus instance for the given rack id `rack_id`
*/
/* TODO-polish revisit rack metadata */
pub fn new_with_id(
rack_id: &Uuid,
pub async fn new_with_id(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering out loud: I've been trying to keep this sync because it's tempting to put async things in here that belong as background processes (e.g., saga recovery or database initialization). I don't fully grok the flow of the root JSON file but it looks like this is reasonable to be done immediately. But I wonder if the (async) caller should pass in the file contents instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also keep it sync and just use std::fs::read. Root roles are generally expected to be very small (hence why we're just keeping it in memory).

Alternatively we can just pass along the PathBuf of the root to Nexus, and Nexus can read it whenever /updates/refresh is hit.

Comment on lines +1984 to +1985
tuf_metadata_base_url: "http://localhost:8000/metadata".to_string(),
tuf_targets_base_url: "http://localhost:8000/targets".to_string(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the longer term plan here? Are we building a new component to serve this endpoint? Is it part of the control plane? If not, maybe we want to treat this like we do Clickhouse and CockroachDB, with easy ways to run it by hand that we use in the test suite to get coverage of this stuff?

Comment on lines +2284 to +2286
// FIXME: if we hit an error in any of these database calls, the available artifact table
// will be out of sync with the current artifacts.json. can we do a transaction or
// something?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this function called? If it's from an API endpoint, and the API request fails, then at least the caller knows it didn't work and we're basically saying its their responsibility to retry until it works.

Is it also a problem that it might be inconsistent? That is, we might have updated some of the artifacts but not all of them?

@iliana iliana closed this Mar 3, 2022
FelixMcFelix added a commit that referenced this pull request Mar 13, 2024
* a4c956e chore(deps): lock file maintenance (#474)
* 6847b04 chore(deps): update rust crate nix to 0.28 (#473)
* c30aa36 Benchmarks and instrumentation (#395)
* 53201de Fix xtask non-package install for fresh systems. (#470)
* 7e56634 Correctly preserve inbound UFID on outbound TCP close (#469)
@iliana iliana deleted the nexus-update-refresh branch May 16, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants